Packages

library(tidyverse)
library(DAAG)
library(data.table)
library(openxlsx)
library(leaps)
library(reshape2)
library(car)
library(corrplot)
library(gtable)
require(cowplot)
library('readxl')
library(zoo)

Problem 1

For this problem we follow the study by Lalonde (1986) on earnings related to various factors such as training, race, gender, etc. The data are available directly from the DAAG package (the data file is nsw74psid1). Your task will be to fit a multiple regression model of the form yˆ = β0 + β1x1 + β2x2 + β3x3 + ···, where yˆ = real earnings in 1978 (re78), and for the predictors, you will need to decide which ones to keep. The complete assignment needs to be typed, include all the plots, and the R source code as well.

Load data

df <- nsw74psid1
attach(df)
head(df)
##   trt age educ black hisp marr nodeg re74 re75 re78
## 1   0  47   12     0    0    0     0    0    0    0
## 2   0  50   12     1    0    1     0    0    0    0
## 3   0  44   12     0    0    0     0    0    0    0
## 4   0  28   12     1    0    1     0    0    0    0
## 5   0  54   12     0    0    1     0    0    0    0
## 6   0  55   12     0    1    1     0    0    0    0
summary(df)
##       trt               age             educ           black       
##  Min.   :0.00000   Min.   :17.00   Min.   : 0.00   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:25.00   1st Qu.:10.00   1st Qu.:0.0000  
##  Median :0.00000   Median :32.00   Median :12.00   Median :0.0000  
##  Mean   :0.06916   Mean   :34.23   Mean   :11.99   Mean   :0.2916  
##  3rd Qu.:0.00000   3rd Qu.:43.50   3rd Qu.:14.00   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :55.00   Max.   :17.00   Max.   :1.0000  
##       hisp              marr            nodeg             re74       
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :     0  
##  1st Qu.:0.00000   1st Qu.:1.0000   1st Qu.:0.0000   1st Qu.:  8817  
##  Median :0.00000   Median :1.0000   Median :0.0000   Median : 17437  
##  Mean   :0.03439   Mean   :0.8194   Mean   :0.3331   Mean   : 18230  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 25470  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :137149  
##       re75             re78       
##  Min.   :     0   Min.   :     0  
##  1st Qu.:  7605   1st Qu.:  9243  
##  Median : 17008   Median : 19432  
##  Mean   : 17851   Mean   : 20502  
##  3rd Qu.: 25584   3rd Qu.: 28816  
##  Max.   :156653   Max.   :121174

(a) Plot a histogram of each variable and discuss its properties.

Binary variables
b1 <- ggplot(data=df) +
  geom_bar(mapping=aes(x=black), fill='#3796db') +
  ggtitle('Black') 

b2 <- ggplot(data=df) +
  geom_bar(mapping=aes(x=hisp), fill='#3796db') +
  ggtitle('Hispanic')

b3 <- ggplot(data=df) +
  geom_bar(mapping=aes(x=marr), fill='#3796db') +
  ggtitle('Married')

b4 <- ggplot(data=df) +
  geom_bar(mapping=aes(x=nodeg), fill='#3796db') +
  ggtitle('No Degree')

plot_grid(b1, b2, b3, b4, nrow=2)

These binary variables each show whether a participant is or is not some characteristic. We can see that, for the most part, this sample of data is not even with respect to these characteristics These observations are weighted towards non black, non hispanic, married, and no degree participants (in relation to each variables other value). This could present an issue with making generalizations to the population because the sample is disproportionately weighted in these particular value frequencies.

Continuous variables - Age and Education
c1 <- ggplot(data=df) +
  geom_histogram(mapping=aes(x=age), binwidth=1, fill='#3796db') +
  ggtitle("Age Distribution")

c2 <- ggplot(data=df) +
  geom_histogram(mapping=aes(x=educ), binwidth=, fill='#3796db') +
  ggtitle("Years of Education Distribution")

plot_grid(c1, c2, nrow=1)

The age distribution tells us that most observations are people between early twenties and thirties, and then between 40 and 50 years of age. This distribution has two peaks, so it is not normally distributed and we should keep this in mind for average metrics or summarizing results. The years of education distribution tells us that for any single value, 12 years is the highest proportion. This makes sense because this is the mandatory 12 years of education from finishing high school. There is a longer tail on the left than the right, which suggests that there is more variation to having less than a high school degree than there is in having more.

Continuous variables - Real Earnings
c3 <- ggplot(data=df) +
  geom_histogram(mapping=aes(x=re74), fill='#3796db', bins=40) +
  xlim(0, 150000) +
  ggtitle('Real Earnings 1974') +
  xlab(NULL)

c4 <- ggplot(data=df) +
  geom_histogram(mapping=aes(x=re75), fill='#3796db', bins=40) +
  xlim(0, 150000) +
  ggtitle('Real Earnings 1975') +
  xlab(NULL)

c5 <- ggplot(data=df) +
  geom_histogram(mapping=aes(x=re78), fill='#3796db', bins=40) +
  xlim(0, 150000) +
  ggtitle('Real Earnings 1978') +
  xlab(NULL)

plot_grid(c3, c4, c5, ncol=1)

The real earnings variables all show somewhat similar distributions with a positive skew. These are characteristic of most income distributions we would expect, because a small percentage of people may earn significantly higher salaries with wider ranges than most of the population.

(b) Estimate your full regression model yˆ = β0 + β1x1 + β2x2 + β3x3 + and show/discuss your results.

linreg <- lm(re78~trt+(age+educ+re74+re75)+(black+hisp+marr+nodeg))
summary(linreg)
## 
## Call:
## lm(formula = re78 ~ trt + (age + educ + re74 + re75) + (black + 
##     hisp + marr + nodeg))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64870  -4302   -435   3786 110412 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -129.74276 1688.51706  -0.077   0.9388    
## trt          751.94643  915.25723   0.822   0.4114    
## age          -83.56559   20.81380  -4.015 6.11e-05 ***
## educ         592.61020  103.30278   5.737 1.07e-08 ***
## re74           0.27812    0.02792   9.960  < 2e-16 ***
## re75           0.56809    0.02756  20.613  < 2e-16 ***
## black       -570.92797  495.17772  -1.153   0.2490    
## hisp        2163.28118 1092.29036   1.981   0.0478 *  
## marr        1240.51952  586.25391   2.116   0.0344 *  
## nodeg        590.46695  646.78417   0.913   0.3614    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10070 on 2665 degrees of freedom
## Multiple R-squared:  0.5864, Adjusted R-squared:  0.585 
## F-statistic: 419.8 on 9 and 2665 DF,  p-value: < 2.2e-16

Here we have the summary of our initial linear regression model with all coefficients included. Right away, we can notice some abnormal numbers. The binary variables all have large coefficients, standard errors, and t values and p values that would suggest that they are not statistically significant at a 95% level. This could be explained with what we saw earlier from the disproportionate class membership for binary variables. On the other hand, age, educ, re74, and re75 show smaller standard errors and t values and p values that suggest they should be kept in a model. The Adjusted R2 score is 0.585, which seems reasonable for right now. The F-statistic strongly indicates that some of these variables are significant.

(c) Compute the Mallows CP statistic for all the plausible models and choose only one model. Discuss why you chose this model. For the next questions, only use your selected model from this part.

Subsets for all variables
subsets <- regsubsets(re78~trt+(age+educ+re74+re75)+(black+hisp+marr+nodeg), 
                      method=c("exhaustive"), nbest=3, data=df)

subsets(subsets, statistic="cp", legend=F, main="Mallows CP", col="steelblue4", 
        min.size=1, max.size=8, cex.subsets=1)

I will focus on subset sizes of 3 to 5 to narrow down my decision.

subsets(subsets, statistic="cp", legend=F, main="Mallows CP (Size 3-5)", col="steelblue4", 
        min.size=3, max.size=5, cex.subsets=1)

From this, I could either choose a three variable model with educ, re74, and re75 or a four variable model with the addition of age. For the purpose of this assignment, I am going to assume that the additional decrease in the Mallows CP statistic of adding age will be beneficial and choose the second model to continue:

New Linear Model
lr <- lm(re78~age+educ+re74+re75)
summary(lr)
## 
## Call:
## lm(formula = re78 ~ age + educ + re74 + re75)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65311  -4350   -430   3731 110415 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1249.75094 1157.63314   1.080 0.280429    
## age          -73.26378   20.12027  -3.641 0.000276 ***
## educ         536.04110   71.56916   7.490 9.32e-14 ***
## re74           0.28414    0.02783  10.209  < 2e-16 ***
## re75           0.56864    0.02745  20.715  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10080 on 2670 degrees of freedom
## Multiple R-squared:  0.5845, Adjusted R-squared:  0.5839 
## F-statistic:   939 on 4 and 2670 DF,  p-value: < 2.2e-16

Here our new model looks better than the first one. The coefficients seem reasonable for the most part, and the t values and p values indicate statistical significance. The Adjusted R2 did not change much, however. Also note that age has a negative sign, which may seem a little odd and could be investigated more. Years of education on average associates with a $536 increase in yearly earnings, which seems reasonable.

(d) Plot the residuals vs. the fitted values.

ggplot(data=df, mapping=aes(x=lr$fitted.values, y=lr$residuals)) +
  geom_point(color='#3796db', alpha=0.7) +
  geom_abline(intercept=0, slope=0, color='red') +
  ggtitle('Residuals vs Fitted Values') +
  xlab('Fitted Values') +
  ylab('Residuals')

This plot is concerning, we would ideally like to see these residuals randomly scattered about and not have such a dense cluster. This suggests that we should reconsider a transformation on our model like a log to improve this.

(e) Plot and discuss the VIF plot.

vars <- c('age', 'educ', 're74', 're75')
vif_lr <- vif(lr)
d <- data.frame(vars, vif_lr)

ggplot(d, aes(vars, vif_lr)) + 
  geom_col(fill='#3796db') +
  ylim(0, 4) +
  geom_abline(intercept=4, slope=0, color='black', alpha=0.8, linetype='dashed') +
  ggtitle('Variance Inflation Factors of Regressors') +
  xlab('Regressors') +
  ylab('Variance Inflation Factor')

The Variance Inflation Factor is used to measure how much variance of a regressor is increased due to multicollinearity. Here we can see that all VIF scores are below 4, but re74 and re75 are noticeably higher. This is not too concerning, and intuitively makes sense. It is reasonable to observe that earnings between one year and the next will be correlated if someone remains in the same position. Because the VIF’s are still below 4, I will move on.

(f) Plot and discuss the correlation graph

# Subset dataframe to variables of interest
df_2 <- df[vars]
corr <- melt(cor(df_2))

ggplot(data=corr, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() +
  geom_text(aes(label=round(value, 2))) +
  ggtitle('Correlation of Regressors') +
  xlab('Regressor 1') +
  ylab('Regressor 2')

Our correlation graph gives expected numbers. Real earnings between 1974 and 1975 show a high 0.86 correlation while other variables are lower. We also must note that age and education have a negative correlation. From our regression summary, we had a positive sign on years of education and a negative sign on age, so in that context it makes sense. However I would expect older people to have higher earnings, and thus a positive correlation with age. This analysis could be result of the sampling, a generation difference in employment types, or some other force that could be explored further.

(g) Plot the Cook’s Distance values. Are there any outliers? If so, discuss what you would do with them.

lr_cooks <- cooks.distance(lr)
plot(lr_cooks, ylab="Cook's distance", type='o', main="Cook's Distance", 
     col="#3796db", pch=20, lwd=.25)

Cook’s Distance is used to search for data points that may be influential in a regression estimate that might be considered for deeper investivation. It measures the effect of deleting an observation, so large values indicate a large influence. From our visualization, there are a few outliers, with one in particular being around 0.35. These points could be investigated by observation to see whether the value is due to an error in data entry, or actually just an extreme case. Potential actions on these could be to ignore them, completely remove them, or impute median/mean values to make them fit. (Whichever option one chooses would be disclosed to any audience of this project)

(h) Plot a histogram of the residuals and discuss the results.

(I prefer a density plot for this particular one)

ggplot(data=df) +
  geom_density(mapping=aes(x=lr$residuals), fill='#3796db') + 
  ggtitle('Residuals Density') +
  xlab('Residuals') +
  ylab('Count')

Our residuals look to be somewhat symmetrically distributed around zero. We can notice that it is skinny, thus having a higher kurtosis than Normal distributions. This aligns with infrequent extreme deviations from the mean as opposed to more frequent but modest deviations. The residuals and fitted values scatterplot and the Cook’s Distance plot both align with this distribution shape.

(i) Plot the QQ Normal Plot and discuss the results.

ggplot(data=df) +
  geom_qq(mapping=aes(sample=lr$residuals), alpha=0.6, color='#3796db') +
  ggtitle('QQ Normal Plot') +
  ylab('Sample Quantiles') +
  xlab('Theoretical Quantiles')

The QQ Normal plot is a visual check used to decide whether our data is normally distributed. If our residual distribution was Normal, we would see a roughly straight line. From this, we can observe that values close to zero are straight, but as we get farther away they begin to deviate. These deviations correspond to the extreme values we saw earlier, and that our residuals have more extreme deviations than expected in a Normal distribution.

(j) Plot the observed vs. predicted values, overlay a Lowess smoother, and discuss the results.

ggplot(data=df_2, mapping=aes(lr$fit, re78)) +
  geom_point(alpha=0.7, na.rm=TRUE) +
  geom_smooth(method='loess', color='#3796db') +
  geom_abline(color='red', linetype='dashed', alpha=0.6) +
  xlim(0, 90000) +
  ggtitle('Observed vs Predicted Response') +
  xlab('Predicted Response') +
  ylab('Observed Response') 

From this plot we can reaffirm what has been said earlier. There is a cluster of dense points from 0-40000 where our points lie closely around the Lowess smoother, indicating that this range of values was somewhat well modeled. We also notice in this range the extreme deviations being underpredicted (lying above the line) and some being overpredicted (lying below the line). There is also a band of observed responses at the horizontal line of zero, corresponding to people with zero earnings in 1978. This may be a place to look at the DGP and whether these observations should be kept. From 40000-70000 there are much less observations to be seen, and they exhibit a larger variance than the points to its left. Outside of 70000 our observations sparse and mostly underpredicted.

From this graph and problem overall we can see that the linear model did not do as good a job as possible, and transforming our data will likely improve modeling metrics.

Textbook Questions

Problem 2.2

Download the US GDP quarterly growth rates and the S&P 500 quarterly returns. For both series, compute their descriptive statistics and their histograms. Are these two series contemporaneously correlated? Comment on your findings.

growth_rates <- read.xlsx("Chapter2_exercises_data.xlsx", sheet=1)
summary(growth_rates)
##      date               GRGDP             RETURN       
##  Length:248         Min.   :-2.7075   Min.   :-26.937  
##  Class :character   1st Qu.: 0.3273   1st Qu.: -0.915  
##  Mode  :character   Median : 0.7679   Median :  2.023  
##                     Mean   : 0.7958   Mean   :  1.962  
##                     3rd Qu.: 1.3107   3rd Qu.:  5.753  
##                     Max.   : 3.9343   Max.   : 20.117

From this summary there a notable differences. GRGDP has a smaller range of values [-3, 4] while RETURN has [-27, 20]. GRGDP has a mean closer to zero, and RETURN has a mean around 2. Both series have medians near their means so skewness should not be too much of an issue.

g <- ggplot(data=growth_rates) +
  geom_histogram(mapping=aes(x=GRGDP), fill='#3796db', bins=40) +
  geom_vline(xintercept=0, linetype='dashed') +
  ggtitle('US Growth Rate') +
  xlab(NULL) +
  ylab('Count')

r <- ggplot(data=growth_rates) +
  geom_histogram(mapping=aes(x=RETURN), fill='#3796db', bins=40) +
  ggtitle('S&P 500 Returns') +
  geom_vline(xintercept=0, linetype='dashed') +
  xlab(NULL) +
  ylab('Count')

plot_grid(g, r, ncol=1)

Again, we can see that the US GDP growth rate is near 1, so its overall trend consistently been positive growth. We would expect that stock market returns would also follow this. However, the S&P 500 returns are centered around 0, so while the US total GDP has been mostly rising, the S&P 500 returns have stayed constant. This suggests that GDP growth and stock market returns are not necessarily correlated.

Problem 3.1

Download monthly data on real personal consumption expenditures and real disposable personal income from FRED.

fred <- read.xlsx("Chapter3_exercises_data.xlsx", sheet=1, detectDates=TRUE)[,1:3]

(a) Calculate and plot growth rates. Compare the level of volatility. Can you explain this with your knowledge of macroeconomics? (permanent income model)

fred['rpce_lag'] <- shift(fred$rpce, n=1L, type=c("lag"))
fred['rpce_growth'] <- log(fred$rpce) - log(fred$rpce_lag)

fred['rdpi_lag'] <- shift(fred$rdpi, n=1L, type=c("lag"))
fred['rdpi_growth'] <- log(fred$rdpi) - log(fred$rdpi_lag)

# Re-order
fred <- fred[, c('date', 'rpce', 'rpce_lag', 'rpce_growth', 'rdpi', 'rdpi_lag', 'rdpi_growth')]
p1 <- ggplot(data=fred) +
  geom_line(mapping=aes(date, y=rdpi_growth), color='green', na.rm=TRUE) +  
  geom_abline(intercept=0, slope=0) +
  ylim(-0.05, 0.05) +
  ggtitle('Income Growth Over Time') +
  xlab('Year') +
  ylab('Growth %')

p2 <- ggplot(data=fred) +
  geom_line(mapping=aes(date, y=rpce_growth), color='red', na.rm=TRUE) +
  geom_abline(intercept=0, slope=0) +
  ylim(-0.05, 0.05) +
  ggtitle('Expenditure Growth Over Time') +
  xlab('Year') +
  ylab('Growth %')

plot_grid(p1, p2, ncol=1)

The expenditure growth rate is overall less volatile compared to the income growth rate. The permanent income model relates to this phenomenon by postulating that current and expected future income levels (together lifetime income) drives consumption (expenditure) patterns, but is smoothed over time. So if someone has an increase in income, they will smooth that gain over their lifetime and not spend it proportionally immediately. Thus in this example, one would change their consumption in magnitude less in response to an the income change. This data is evidence of the permanent income hypothesis.

(b) Regress consumption growth on disposable income growth and discuss.

lr_growth <- lm(rpce_growth~rdpi_growth, data=fred)
summary(lr_growth)
## 
## Call:
## lm(formula = rpce_growth ~ rdpi_growth, data = fred)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0304050 -0.0029792  0.0001606  0.0030383  0.0244504 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.0022543  0.0002242  10.056  < 2e-16 ***
## rdpi_growth 0.1745175  0.0292014   5.976  3.8e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.005317 on 637 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.05309,    Adjusted R-squared:  0.05161 
## F-statistic: 35.72 on 1 and 637 DF,  p-value: 3.799e-09

This summary says that a disposable income growth expects a positive change in consumption. The t values and p values suggest statistical significance, so at the 95% level income growth appears to positively drive expenditure. This R2 score is also very low, meaning that our independent variable of income growth only accounts for 5% of total variation. Our coefficient of rdpi_growth means that a 1% growth in income is expected to give a 0.17% growth in consumption. Because 0.17% < 1%, this aligns with the permanent income hypothesis.

(c) Add a lag of disposable income growth as a parameter to your model and comment on the possiblity of a lag in consumption growth.

fred['rdpi_growth_lag'] <- shift(fred$rdpi_growth, n=1L, type=c('lag'))

lr_growth_lag <- lm(rpce_growth~rdpi_growth+rdpi_growth_lag, data=fred)
summary(lr_growth_lag)
## 
## Call:
## lm(formula = rpce_growth ~ rdpi_growth + rdpi_growth_lag, data = fred)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0300818 -0.0028874 -0.0000051  0.0029768  0.0255088 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.0019889  0.0002405   8.269 7.90e-16 ***
## rdpi_growth     0.1872175  0.0293870   6.371 3.61e-10 ***
## rdpi_growth_lag 0.0828418  0.0293865   2.819  0.00497 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.005284 on 635 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.06476,    Adjusted R-squared:  0.06182 
## F-statistic: 21.99 on 2 and 635 DF,  p-value: 5.855e-10

With the lagged consumption growth we do see a small increase in the rdpi_growth coefficient, as well as a positive coefficient for the lag parameter. The actual coefficient of rdpi_growth_lag means that a 1% increase in income in the previous period is expected to give a 0.08% increase in consumption in the current period. The t values and p values of the intercept and the rdpi_growth variable remain to suggest significance, but the lagged parameter is just passing by at the 95% significance level. This finding does not present strong evidence that last periods growth in income has a significant effect on consumption pattern, which coincides with the permanent income hypothesis. Also, the Adjusted R2 rose to 0.062, but this increase is not notably large.

PROBLEM 3.3

Download this data. For each data set, plot the time series, write the exact definition, periodicity, and units. Judge whether the underlying stochastic process may be first and second order weakly stationary. Explain.

Load data and compute means for visualizations
us_real_gdp <- read.xlsx("Chapter3_exercises_data.xlsx", sheet=3, detectDates=TRUE)[,1:2]
us_real_gdp_mean <- mean(us_real_gdp$rgdp)

exchange_rate <- read.xlsx("Chapter3_exercises_data.xlsx", sheet=4, detectDates=TRUE)[,1:2]
exchange_rate_mean <- mean(exchange_rate$jpy_usd)

maturity_yield <- read.xlsx("Chapter3_exercises_data.xlsx", sheet=5, detectDates=TRUE)[,1:2]
maturity_yield_mean <- mean(maturity_yield$CMRate10Yr, na.rm=TRUE)

unemployment <- read.xlsx("Chapter3_exercises_data.xlsx", sheet=6, detectDates=TRUE)[,1:2]
unemployment_mean <- mean(unemployment$unemrate)
Plot Time Series
US Real GDP
ggplot(data=us_real_gdp, mapping=aes(date, rgdp)) +
  geom_line(color='#3796db', lwd=1) +
  geom_hline(yintercept=us_real_gdp_mean, linetype='dashed') +
  ggtitle('US Real GDP') +
  xlab('Year') +
  ylab('RGDP')

Definition: Value of goods and services produced in the US adjusted for inflation.
Periodicity: Quarters, 1947-2012.
Units: USD billions chain weighted.
Stationary: There is a clear upward trend with some small local dips and peaks, so this time series is not first (second) order weakly stationary.

Exchange Rate of Yen vs USD
ggplot(data=exchange_rate, mapping=aes(DATE, jpy_usd)) +
  geom_line(color='#3796db', lwd=1) +
  geom_hline(yintercept=exchange_rate_mean, linetype='dashed') +
  ggtitle('Exchange Rate of Yen vs USD') +
  xlab('Year') +
  ylab('Rate')

Definition: The value of yen (foreign currency) that is equal to 1 USD.
Periodicity: Monthly, 1971-01-04 to 2012-06-01.
Units: Rate of Yen to 1 USD.
Stationary: There is a clear downward trend with some small and moderate local dips and peaks, so this time series is also not first (second) order weakly stationary.

10-year Treasury constant maturity yield
# Removing zero values under assumption that these should be NA.
maturity_yield[maturity_yield==0] <- NA

ggplot(data=maturity_yield, mapping=aes(DATE, CMRate10Yr)) +
  geom_line(color='#3796db', lwd=1) +
  geom_hline(yintercept=maturity_yield_mean, linetype='dashed') +
  ggtitle('10-year Treasury Constant Maturity Yield') +
  xlab('Rate') +
  ylab('Year')
## Warning: Removed 1 rows containing missing values (geom_path).

Definition: Yields on actively traded non-inflation-indexed issues adjusted to constant maturities.
Periodicity: Daily, 1962-01-02 to 2012-06-07.
Units: Rate.
Stationary: This plot is less clear in respect to any trend. Before the mid 1980’s there is an upward trend, but after there is a downward trend. There is does not appear to be a meaningful mean of this series nor is there a seemingly constant degree of variance in the cycles. This series is doubtful to be first (second) order weakly stationary.

US Unemployment Rate
ggplot(data=unemployment, mapping=aes(DATE, unemrate)) +
  geom_line(color='#3796db', lwd=1) +
  geom_hline(yintercept=unemployment_mean, linetype='dashed') +
  ggtitle('US Unemployment Rate') +
  xlab('Year') +
  ylab('Rate')

Definition: The percent of unemployed people over the labor force. The US Labor force includes those 16 years of age and up, not in institutions, not on active military duty, residing in the United States.
Periodicity: Monthly, 1948-01-01 to 2012-05-01.
Units: Rate.
Stationary: This plot has an overall upward trend, but does in fact fluctuate about the mean more than the previous series. It is unclear if this is first order weakly stationary. Since the variances are more obviously not constant, I would be confident enough to at least claim that it is not second order weakly stationary.

PROBLEM 4.3

Compute the ACF and PACF functions of quarterly and annual frequencies of housing prices, interest rates, price growth, and interest rate growth. Comment on the differences in the autocorrelation functions. Which series has stronger time dependency? Compare across annual vs quarterly.

annual <- read_excel("Chapter4_exercises_data.xls", sheet=1)
setnames(annual, c('year', 'price', 'interest'))

quarterly <- read_excel("Chapter4_exercises_data.xls", sheet=2)
setnames(quarterly, c('quarter', 'price', 'interest'))

# Computations
annual['price_lag'] <- shift(annual$price, n=1L, type=c("lag"))
annual['price_growth'] <- log(annual$price) - log(annual$price_lag)
annual['interest_lag'] <- shift(annual$interest, n=1L, type=c("lag"))
annual['interest_growth'] <- log(annual$interest) - log(annual$interest_lag)
# Re-order
annual <- annual[, c('year', 'price', 'price_lag', 'price_growth', 'interest', 'interest_lag', 'interest_growth')]

# Computations
quarterly['price_lag'] <- shift(quarterly$price, n=1L, type=c("lag"))
quarterly['price_growth'] <- log(quarterly$price) - log(quarterly$price_lag)
quarterly['interest_lag'] <- shift(quarterly$interest, n=1L, type=c("lag"))
quarterly['interest_growth'] <- log(quarterly$interest) - log(quarterly$interest_lag)
# Re-order
quarterly <- quarterly[, c('quarter', 'price', 'price_lag', 'price_growth', 'interest', 'interest_lag', 'interest_growth')]
Compute ACF and PACF Functions
Quarterly
acf(quarterly$price, na.action=na.pass, main='Quarterly Price ACF')

pacf(quarterly$price, na.action=na.pass, main='Quarterly Price PACF')

The ACF Price plot shows that each price value at current time t is correlated with itself at time t-1, and as the time series goes further back the autocorrelations become weaker. The PACF Price plot shows a strong correlation between time t and its immediate precedent value at t-1, but does not show correlation when we jump through time periods to t-h where h>1.

acf(quarterly$interest, na.action=na.pass, main='Quarterly Interest ACF')

pacf(quarterly$interest, na.action=na.pass, main='Quarterly Interest PACF')

The ACF and PACF Interest plots show the similar results to above.

acf(quarterly$price_growth, na.action=na.pass, main='Quarterly Price Growth ACF')

pacf(quarterly$price_growth, na.action=na.pass, main='Quarterly Price Growth PACF')

Now, when we attempt to autocorrelate price growth, we observe different patterns. The Price Growth ACF plot suggests some correlation for about 9 periods lagged, and then after this time horizon there is little evidence for correlation. The Price Growth PACF plot shows three lines outside of the confidence interval, and none besides. So comparing price to price growth rate, we can see that price level differs significantly in its correlation across a time series.

acf(quarterly$interest_growth, na.action=na.pass, main='Quarterly Interest Growth ACF')

pacf(quarterly$interest_growth, na.action=na.pass, main='Quarterly Interest Growth PACF')

The Interest Growth ACF plot shows a familiar pattern as before, but not that this is the ACF plot and the pattern here has previously resembled ACF plots. Interest growth is suggested to be less correlated with itself outside of 1 lagged period. Our final visual shows no autocorrelation between interest growth and other random time horizons. Overall, we saw that the price and interest show strong time dependency, and when considered against isolated t-h periods only 1 period lagged seemed to be time dependent. However, with the growth rates themselves we are less confident about time dependency.

Annual
acf(annual$price, na.action=na.pass, main='Annual Price ACF')

pacf(annual$price, na.action=na.pass, main='Annual Price PACF')

acf(annual$interest, na.action=na.pass, main='Annual Interest ACF')

pacf(annual$interest, na.action=na.pass, main='Annual Interest PACF')

acf(annual$price_growth, na.action=na.pass, main='Annual Price Growth ACF')

pacf(annual$price_growth, na.action=na.pass, main='Annual Price Growth PACF')

acf(annual$interest_growth, na.action=na.pass, main='Annual Interest Growth ACF')

pacf(annual$interest_growth, na.action=na.pass, main='Annual Interest Growth PACF')

Comparing annual and quarterly frequencies, we observe mostly the same general patterns. On average the autocorrelation measures are lower and the horizons at which time dependency are strong are shorter in range. This means that the declines in the autocorrelations are happening faster for the annual frequency. Intuitively, this makes sense. As a year is a longer period, it is likely to be less correlated with its lagged self, and the absolute range in which values are time dependent expected to be shorter compared to a higher frequency.